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This study aims to build a computer model to detect built-up land in the 
identified tsunami hazard zone based on Sentinel 2A imagery using the 
normalized built up area index (NBI), urban index (UI), normalize difference 
build-up index (NDBI), a modified built-up index (MBI), index-based built- 
up index (IBI) algorithms, optimized with machine learning Random Forest 
(RF) and extreme gradient boosting (XGboost) algorithms and the spatial 
patterns are predicted using the ordinary kriging (OK) method. Testing of 
the accuracy of the classification and optimization results was performed 
using the Kohen Kappa and overall accuracy functions. The results of the 
study show that a built-up land consisting of open land and water, 
settlements, industry areas, and agriculture and tourism areas can be 
identified using the parameters of built-up indices. The accuracy testings that 
were performed using overall accuracy and Kohen Kappa methods show that 
classification and prediction are highly accurate using XGboost machine 


learning, namely >91%. This study produces a novelty of finding, namely a 
computer model to detect and predict the spatial distribution of built-up land 
in 4 scales, i.e., very low, low, high, and very high based on NBI, UI, NDBI, 
MBI, IBI data extracted from Sentinel 2A imagery. 
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1. INTRODUCTION 

Currently, the methodology for conducting tsunami vulnerability detect and assessments is very 
advanced and developed rapidly starting from modeling methods are linear, non linear, numerical, 
photogrammetry image analysis and remote sensing [1]-[5]. Remote sensing image analysis methods include 
medium resolution images such as Landsat 8 OLI and Sentinel 2A, or high-resolution images such as SPOT 5 
and Quickbird [6]-[8]. A quick calculation of damage to buildings caused by a tsunami can be done because 
of the existence of various machine learning functions and built-up indices data extracted from remote 
sensing imageries [9]. Machine learning methods have long been applied to mitigate tsunamis, including 
predicting inundation, maximum wave height and arrival time of tsunami waves on land, even though the 
uncertainty of the prediction results is very high [10]. In Indonesia, tsunami is the threat of disaster in the 
future due to indicators of ancient tsunami silt deposits on the south coast of Java and the Euro-Asian and 
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Indo-Australian plates which have the potential to cause large earthquakes and trigger tsunami waves from 
the Sunda Strait to Bali [11]-[18]. 

In terms of seismotectonic zones in Indonesia, the coastal areas of Central Java and Yogyakarta are 
included in the zone B with an intensity of tsunami events of more than 2.5 times in a period of 30-50 years. 
In this zone, tsunamis are generated by two types of earthquakes, namely the subduction of the Indian Ocean 
Plate under the Eurasian Plate and the pressure of the arc plate which lies east to west in the north of the 
Islands of Bali, Lombok and Sumbawa [19]. Past data shows that the area of zone B, especially the southern 
seas of Central Java and Yogyakarta, has been hit by 20 tsunamis with varying strengths throughout human 
history, which were recorded from before 1600 until the end of 2006 [20]. Currently, there is no computer- 
based modeling study to find a model that can detect a built-up land quickly and accurately in zone B 
tsunami disaster risk areas. As a solution to this problem, a study was conducted with the aims: i) building a 
computer model to detect built-up land from Sentinel 2A satellite imagery data; ii) classifying and optimizing 
the digital number (DN) data detection process for Sentinel 2A satellite images using machine learning 
Random Forest (RF) and eXtreme gradient boosting (XGBoost); and iii) predicting the spatial pattern of the 
distribution of built-up lands using the Ordinary Kriging (OK) method. This study produces a novelty of 
finding, namely a computer model to detect and predict the spatial distribution of built-up land in 4 scales: 
very low, low, high, and very high based on normalized built up area index (NBI), urban index (UD), 
normalize difference build-up index (NDBI), modified built-up index (MBI), index-based built-up index 
(IBI) data extracted from Sentinel 2A imagery. The cluster K-Mean algorithm (CKA) is an algorithm that 
works on Euclidian distance data y to the centroid (c) [21]-{23]. Euclidian distance is formulated by (1): 


d(y,c) = VŒ- = DnE (yi -Ci )^2) (1) 

RF algorithm is a combination of non-parametric classification method and the classification and 
decision tree (CART) { DT (a, O_r )}_(r = 1)^T method, which a is the input data observed in vector form, 
O_r is sample data in vector form and is taken randomly from the input data set as a result of 
0_1,0_2, 0_3 ...0_(r — 1) observations. T notation is the sample data which is used as training data of 
bootstrap. If depicted in a forest, sample data are trees that will be grown and selected randomly and 
classified into certain nodes or classes using the CART method [24]. Number of trees grown is represented as 
n-tree and the number of classes or classifiers is represented as m-try. To classify a tree and divide it into 
certain nodes or classes, the Gini index function is used with (2): 


Gi=1- ¥_(r=1)" KS _n2,rJ (2) 
Where S_n^2,r is the ratio of training data taken randomly from historical data, and n notation is the 
category of classified nodes [25], [26]. XGBoost is a machine learning algorithm that works based on the 
concept of a decision tree, where each decision tree node will be connected to one another hierarchically 
[27]. Each tree will contribute to building a large classifier by forming an ensemble of weak classifiers [28]. 
The XGBoost as (3)-(5): 


Vi=TAKS ff _k (xi)] (3) 
L@)=Ytelyi-yi)+ Yk age) (4) 
AGF_k) = yT + 1/2 y|(@)|*2 (5) 


Where f_k is additive function that represents the tree, and K additive function to form new data as a 
prediction result. The notation (y_i — y_i) is to determine the difference between the prediction value y_i 
and predictor value y_i. Notation Q(f_k ) is a complexity model, the notation T represents the number of 
trees on each node and w represents the value of each tree [29]. OK uses structural analysis and variogram to 
assess the weight of the location that is not the point of observation in all spatial fields [30]. The OK as (6): 


y(h) = 1/2N (h) YG = 1) N (h)! (Z(t) — Zi — h)]^2 (6) 


Where y(h) is a notation from semi variogram, A is a notation from the distance lag, N (h) is a notation of the 
number of observation points with distance h, Z(x_i) and Z(x_i — h) is a regionalized variable [31]. The 
accuracy of the classification and optimization of RF and XGboost is carried out using the overall accuracy 
method and the Kohen Kappa method. The accuracy overall as (7): 
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0A =Y_(i=1)* 


Hx/n (7) 


which B is the number of classes used in classification, value x is the amount of data testing and value n is 
the amount of data analyzed [32]. The Kohen Kappa as (8): 


K =(x_0-—x_e)/(1—-x_e) (8) 


where x_0 is its accuracy and x_e is its probability [32]. The proposed computer model algorithms as the 
novelty of finding of this study is shown are: 


Begin 
Data Sentinel-2 ESA as Numeric = {ML, QL, AL, MpQcal, Ap} 
Aster G-DEM as Numeric = {Elavation} 


Process radiometric correction: 

Step 1: The conversion data DN to TOA Radiance: LA = MLOcal + AL 

Step 2: The conversion data DN to TOA Reflectance: pA‘= MpQcal + Ap 

Calculate process atmospheric correction: 

p* (A) =pr(à) + pa(A) + pra(A) + T(A) pg(à) + t(A) pwc(à) + t(A) pBOA (à); 
Calculate process geometric correction; 

Calculate process Built-Up Indices: 

BI= (p_Red*p_SWIR2)/p_NIR ; UI= (p_SWIR2—p_NIR )/(p_SWIR2+p_NIR) ; NDBI= (p_SWIR1— 
p_NIR )/(p_SWIR1 + p_NIR); MBI= (p_SWIR1 * p_Red — (p_NIR * p_NIR ))/(p_Red + p_NIR + p_SWIR1) ; 
IBI=((2xp_SWIR1)/(p_SWIR1 + p_NIR )— (p_NIR/(p_NIR + p_Red )) — (p_Green/(p_Green + p_SWIR1)))/ 
((2xp_SWIR1)/(p_SWIR1 + p_NIR) + (p_NIR/(p_NIR + p_Red ))— (p_Green/(p_Green +p_SWIR1))) ; 
Calculate the Random Forest algorithms process: 

Gi=1—- VY (r=1)n® £5 _n2,rJ 

Calculate the XGBoost algorithms process: 

Afk) =P + 1/2 y\(lo))12 

Calculate the accuration Prediction Count process: 

Overall Accuracy: OA=Y_(i=1)*Biix/n ; Kohen Kappa: (x_0—-x_e)/(1—x_e) 

Calculate Ordinary Kriging: y(h) =1/2N(h) X (i=1 Z(x_i) —Z(x_i — h)ļ]^2 

Display matrix decision assesment scale: ¥ =( 
Confusion Matrix (CM) values: 

Begin 


f CM index < 1,39 
than Very Low; 
else 
f CM index < 1,40 - 2,39 
than Low; 
else 
f CM index < 1,40 - 3,39 
than High; 

else 
f CM index > 3,39 
than Very High; 


EndIf 
End 


The computer model was proposed using a framework consisting of 3 layers, namely: i) pre- 
processing, ii) analytical data, and iii) interpretation. The pre-processing layer consists of atmospheric, 
radiometric and geometric correction processes, image extraction using the NBI, UI, NDBI, MBI, IBI 
algorithms and the Raster Statistics for Polygon function, producing built-up indices numeric data. The data 
analytical layer consists of data classification process using the CKA method, optimization classification 
process using machine learning, and an accuracy test process using the Kohen Kappa function and overall 
accuracy. Layer interpretation is the process of spatial distribution using OK and the output classification 
process in 4 scales, namely: very low, low, high, and very high (Figure 1). The proposed computer model 
framework as the novelty of finding of this study is shown in Figure 1. 
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Figure 1. Framework computer model for assessing the impact of the tsunami on built-up indices optimized 
by machine learning classifier 


2. METHOD 

This study was conducted on the southern coast of Central Java Province, Indonesia. The 
coordinates of the study area are: Purworejo Regency: 109 ° 47’-110 ° 08’ and 7 ° 32’-7 ° 54’, Kebumen 
Regency: 109 ° 33’-109 ° 50’ and 7 ° 27’-7 ° 50’, Cilacap Regency: 109°-109°30’and 7°30’-7°50’. The 
observed areas include 57 villages which consist of 39 villages in Purworejo Regency area, 7 villages in 
Kebumen Regency, and 11 villages in Cilacap Regency. land use in the study area is classified into 8 types, 
namely: rivers/ponds fisheries, forests/vegetation, grass, mixed agriculture, built-up lands, bare lands, mix of 
built-up, and scrub/srub. The data used for this study is remote sensing imagery of the Sentinel 2A satellite 
for the 2016-2021 observation period. The built-up indices equation can be seen in Table 1. 


Table 1. The built-up indices equation used in the study 


Ref. Built-up indices Formula 

[33] New built-up index NBI = (p_Red * p_SWIR2)/p_NIR (9) 

[34] Urban index UI = (p_SWIR2 — p_NIR )/(p_SWIR2 + p_NIR ) (10) 

[35] Normalized difference NDBI = (p_SWIR1 — p_NIR )/(p_SWIR1 + p_NIR ) (11) 
build-up index 

[36] Modified built-up index MBI = (p_SWIR1 * p_Red — (p_NIR * p_NIR ))/(p_Red + p_NIR + p_SWIR1 ) (12) 

[37] Index based built-up index IBI = ((2xp_SWIR1)/(p_SWIR1 + p_NIR) — (p_NIR/(p_NIR + p_Red )) — (13) 


(p_Green/(p_Green + p_SWIR1 )))/((2xp_SWIR1)/(p_SWIR1+ p_NIR) + 
(p_NIR/(p_NIR + p_Red )) — (p_Green/(p_Green + p_SWIR1 )) ) 


Coastal elevation is one of indicators that must be analyzed to determine tsunami vulnerability in 
addition to land use and land cover characteristics. Elevation is modeled using digital elevation model (DEM) 
aster imagery and elevation interpretation as shown in Table 2 [38]. 


Table 2. Elevation and level of vulnerability [38] 


Elevation (m) Vulnerability 
<5 Very high 
5-10 High 
10-15 Medium 
15-20 Low 
>20 Very low 


3. RESULT AND DISCUSSION 


The first indicator for assessing building damage as a result of the tsunami in this computer model is 


the built-up area along the coast. Determining the area of built-up land is carried out by the mechanism of 
extraction, identification and separation of DN values from Sentinel 2A imagery for the built-up land 
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category from DN pixels for fisheries, forest and agriculture categories using the supervised classification 
method. This process will produce a map image with raster data format. The raster data format is converted 
to a vector data format to make it easier to calculate the land area. The comparison of the land area that is in 
the category of built-up land within a period of 5 years, namely 2017-2021 is shown in Figure 2. 


1,500,000 


1,000,000 a = ja re 
500,000 


2016 2017 2018 2019 2020 2021 2022 
Year 


Width (Ha) 


—e—Fishferies —*— Forest —&— Agriculture —m-— Built-up 


Figure 2. Comparison of built-up land with fisheries, forest and agriculture in 5 years period of 2017-2021 


Based on the analysis of Sentinel 2A imageries, there has been an increase in the built-up areas of 
226,683 m? in 5 years period. The increase of built-up lands includes the construction of settlements, offices, 
trade buildings, industrial buildings, and social and public facilities in the study area. The indicator of 
potential damage to buildings can be identified using built-up indices, namely IBI, NBI, MBI, NBAI, DBI 
and UI. The indicator of built-up indices is the DN value of each pixel as a representation of a built-up area 
based on the wavelength reflected and received by the satellite sensor. The IBI value which increases every 
year indicates a change in function and land use from previously a vegetated or vacant land to built-up land 
both in the form of physical buildings and changes in the land use. These changes can be detected because 
IBI is actually composed of 3 other indices namely soil-adjusted vegetation index (SAVTI), modified 
normalized difference water index (MNDWI) and NDBI indices [38]. The range of built-up indices is 
between -1.00 to +1.00, which negative values represent areas dominated by surface water and vegetation 
while positive values represent areas dominated by physical buildings. In Figure 3(a) the mean value of 
built-up indices IBI is 0.080 and Figure 3(b) the mean value of built-up indices MBI is 0.077 representing the 
dominance of built-up lands, a small portion of open lands and surface waters in the study area. High values 
of built-up indices represent the complexity of settlements, industry or tourism, while low values represent 
open lands. In Figures 3(c) to (f) the mean value of built-up indices MBI is -0.093, NBAI is -0.241, NDBI is 
-0.419 and UI is -0.204 which represents the domination of water surface areas such as paddy fields and 
aquaculture, vegetations such as shrubs, plantations and forests. 

Built-up indices data are classified into 4 categories based on built-up land density indicators using 
the CKA algorithm, namely very low, low, high, and very high. The purpose of this classification process is 
to provide labels in the form of values for each observation or sampling area for each built index indicator. 
Testing of the results of the classification is carried out using machine learning RF and XGboost. 

The results of testing with machine learning predict the distribution pattern using OK method. The 
spatial pattern of building density in the study area using the RF algorithm in; 2016 (Figure 4(a)), 2020 
(Figure 4(b)), and 2021 (Figure 4(c)). The RF algorithm works by forming vectors from the input data, 
namely build-up indices which are denoted as 0_1,0_2,0_3...0_(r—1). Next, the formed vectors are 
randomly selected as training data which are notated as training data and some of them become testing data 
before being calculated with the Gini index. The built-up indices in the RF algorithm are nodes that are 
described as trees which each tree has branches of very low cluster, low cluster, high cluster, and very high 
cluster. The results of the analysis using the RF algorithm can be seen in Figure 4. Figure 4 shows 
comparison of the spatial patterns of distribution of building density in the study area using the RF algorithm 
between 2016 (Figure 4(a)), 2020 (Figure 4(b)), and 2021 (Figure 4(c)) based on built-up indices data. Data 
of 2016 shows that study area Figure 4(a) has a low to very low building density which is shown in green to 
blue. In most areas of study area Figure 4(b). the density of built-up land is still high to very high (yellow and 
red). In 2021, there will be a higher increase compared to 2020, the study area Figure 4(c) has a high to very 
high building density which is shown in yellow to red colors. In most areas study, they show a high to very 
high built-up density (yellow and red colors). 
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Figure 3. Time series data of built-up indices: (a) IBI, (b) MBI, (c) NBI, (d) NBAI, (e) NDBI, and (f) UI 
from 2017-2021 


(a) (b) (c) 


Figure 4. The spatial pattern of building density in the study area using the RF algorithm year: (a) 2016, 
(b) 2020, and (c) 2021 


Figure 5 shows the comparison of the spatial pattern of building density in the study area using the 
XGBoost algorithm between: 2016 (Figure 5(a)), 2020 (Figure 5(b)), and 2021 (Figure 5(c)) based on 
built-up indices data. The XGBoost algorithm works through structured data processing using a decision tree, 
namely testing the data attributes (built up vegetation indices) in each node using the criteria for very low 
cluster, low cluster, high cluster and very high cluster, and the test results are represented on each branch. 
The results of the XGBoost analysis are almost the same as that of the RF. In 2016, it can be seen that study 
area has a low to very low building density which is shown in green to blue colors. In 2021 there will be a 
higher increase compared to 2016. The areas study has a high to very high building density which is shown in 
yellow to red colors. Testing the accuracy of the results of RF and XGboost analysis is carried out using 2 
methods, namely overall accuracy and Kohen Kappa. 
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The test results can be seen that the RF and XGBoost tests using the overall accuracy and Kohen 
Kappa methods have an accuracy above 80% so that classification and prediction are spatially very valid. RF 
have a test result 0.913 of overall accuracy and 0.866 of Kohen Kappa. XGBoost have a test result 0.947 of 
overall accuracy and 0.921 of Kohen Kappa. The final decision that will be represented in the computer 
model is made using the decision matrix method, which this method requires to form a scale for each variable 
(Table 3). The variable of build-up density is described in a scale as 1 for a very low density, 2 for a low 
density, 3 for a high density and 4 for a very high density. The elevation variable is described in a scale as 1 
for a very low vulnerability, 2 for a low vulnerability, 3 for a medium vulnerability, 4 for a high 
vulnerability, and 5 for a very high vulnerability. The variable of accuracy is formulated in a scale > 0.9 for 1 
and > 0.9 for 2. Based on the decision matrix in Table 3. it can be seen that the most accurate algorithm is 
XGBoost when it is compared to RF, and in both algorithms. It can be seen that 2020 and 2021 have a very 
high level of tsunami vulnerability due to their high density of buildings and due to a very low elevation of 
< 5 m above sea level. An assessment scale is produced by dividing the number of a value with the variable it 
uses so that the assessment scale is produced from the lowest value of | and the highest value of 4. Based on 
the experiments, it can be seen that the most optimal and accurate algorithm is XGBoost because it produces 
an assessment scale of a very high vulnerability area with a scale of 3.6, while the RF algorithm shows an 
assessment scale of a high vulnerability area with a scale of 3.3. 


(a) (b) (c) 


Figure 5. The spatial pattern of building density in the study area using the XGboost algorithm year: (a) 2016, 
(b) 2020, and (c) 2021 


Table 3. The confusion matrix on computer model for a decision of tsunami high vulnerability assessment 


Algorithm Year ofdata Build-up density _ Elevation Accuracy Assessment scale Symbol 


RF Data 2016 1 5 1 2.3 Blue 
RF Data 2020 4 5 i 3.3 Yellow 
RF Data 2021 4 5 1 3.3 Yellow 
XGBoost Data 2016 1 5 2 2.6 Green 
XGBoost Data 2020 4 5 2 3.6 Red 
XGBoost Data 2021 4 5 2 3.6 Red 


4. CONCLUSION 

The results of the study show that the built-up indices IBI value of 0.080 and NBI value of 0.077 as 
built-up representing the dominance of built-up lands, a small portion of open lands and surface waters in the 
study area. High values of built-up indices represent the complexity of settlements, industrial areas and 
tourism areas, while low values represent open lands. The mean value of the built-up indices of MBI 
(-0.093), NBAI (-0.241), NDBI (-0.419) and UI (-0.204) represents the dominance of surface water areas 
such as paddy fields and aquacultural areas, vegetations such as shrubs, plantations and forests. Testing the 
performance accuracy of machine learning RF using the overall accuracy method shows the value of 0.913 
with Kohen Kappa of 0.866 which indicates that the classification of built-up indices data is very valid. 
Testing the performance accuracy of XGBoost machine learning using overall accuracy shows the value of 
0.947 and Kohen Kappa of 0.921 also shows that the classification of data built-up indices is very valid. 
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